Discriminative Learning of Visual Data for Audiovisual Speech Recognition

نویسنده

Alexandrina Rogozan

چکیده

In recent years a number of techniques have been proposed to improve the accuracy and the robustness of automatic speech recognitionin noisy environments. Among these, supplementing the acoustic information with visual data, mostly extracted from speaker's lip shapes, has been proved to be successful. We have already demonstrated the eeective-ness of integrating visual data at two diierent levels during speech decoding according to both direct and separate identiication strategies (DI+SI). This paper outlines methods for reinforcing the visible speech recognition in the framework of separate identiication. First, we deene visual-speciic units using a self-organizing mapping technique. Second, we complete a stochastic learning of these units with a discriminative neural-network-based technique for speech recognition purposes. Finally, we show on a connected-letter speech recognition task that using these methods improves performances of the DI+SI based system under varying noise-level conditions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Audio-Visual Features for Unsupervised Speech Recognition

Research on “zero resource” speech processing focuses on learning linguistic information from unannotated, or raw, speech data, in order to bypass the expensive annotations required by current speech recognition systems. While most recent zero-resource work has made use of only speech recordings, here, we investigate the use of visual information as a source of weak supervision, to see whether ...

متن کامل

Visual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech

This paper reports results of a study investigating the visual information conveyed by the dynamics of internal articulators. Intelligibility of synthetic audiovisual speech with and without visualization of the internal articulator movements was compared. Additionally speech recognition scores were contrasted before and after a short learning lesson in which articulator trajectories were expla...

متن کامل

Improving lip-reading performance for robust audiovisual speech recognition using DNNs

This paper presents preliminary experiments using the Kaldi toolkit [1] to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kal...

متن کامل

Speaker adaptation for audio-visual speech recognition

In this paper, speaker adaptation is investigated for audiovisual automatic speech recognition (ASR) using the multistream hidden Markov model (HMM). First, audio-only and visual-only HMM parameters are adapted by combining maximum a posteriori and maximum likelihood linear regression adaptation. Subsequently, the audio-visual HMM stream exponents are adapted to better capture the reliability o...

متن کامل

End-to-end Audiovisual Speech Recognition

Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

International Journal on Artificial Intelligence Tools

دوره 8 شماره

صفحات -

تاریخ انتشار 1999

Discriminative Learning of Visual Data for Audiovisual Speech Recognition

نویسنده

چکیده

منابع مشابه

Analysis of Audio-Visual Features for Unsupervised Speech Recognition

Visual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech

Improving lip-reading performance for robust audiovisual speech recognition using DNNs

Speaker adaptation for audio-visual speech recognition

End-to-end Audiovisual Speech Recognition

عنوان ژورنال:

اشتراک گذاری